Understanding and retrieving information from long egocentric videos is a challenging problem due to their unstruc-tured nature and the presence of complex temporal events. The project presents a comparative study between two Temporal Question Answering (Temporal-QA) frameworks designed for egocentric daily activity videos. The first model employs a single-pass Audio-Visual framework with Gaussian Contrastive Grounding (GCG) to localize relevant temporal segments and generate answers using multimodal reasoning. While effective, the approach is limited by sparse temporal sampling, which can miss short-duration events. The second model introduces an enhanced Coarse-to-Fine (C2F) hierarchical grounding strategy. The model performs a global coarse scan to identify candidate segments, followed by a fine-grained local analysis to capture detailed temporal information. It also extends the grounding mechanism using a Gaussian Mixture Model (GMM) and im-proves multimodal integration through bidirectional audiovisual fusion. The comparative analysis evaluates both models on the NExT-QA dataset using metrics such as accuracy and reasoning capability. Experimental results demonstrate that the C2F model significantly outperforms the single-pass baseline, achieving im-proved temporal localization and better handling of complex causal and short duration events.
Introduction
Egocentric videos from wearable devices generate large, unstructured, long-duration first-person data that is difficult to analyze manually. To address this, the paper focuses on Temporal Question Answering (TQA), where users can query videos in natural language and retrieve answers based on both visual content and temporal relationships.
Recent advances in multimodal learning and large multimodal models have improved Video Question Answering, especially through temporal grounding, which links queries to specific time segments in a video. However, existing methods have limitations: Single-Pass approaches are efficient but may miss fine-grained or short events due to uniform sampling, while Coarse-to-Fine hierarchical methods improve precision but add computational complexity.
This work presents a comparative study of these two TQA frameworks rather than proposing a new model. It analyzes their architectures, grounding strategies, and multimodal integration, and evaluates them on the NExT-QA benchmark using metrics like accuracy and reasoning ability. The goal is to highlight trade-offs between simplicity vs. complexity and efficiency vs. precision in temporal video reasoning.
The related work section reviews:
Evolution of VideoQA from RNN-based models to Transformer and large multimodal models.
Importance of temporal grounding and weak supervision methods like Gaussian Contrastive Grounding (GCG).
Hierarchical Coarse-to-Fine approaches for better temporal localization.
Growing role of audio-visual learning for improved contextual understanding.
The methodology describes a Single-Pass GCG-based framework for TQA:
Inputs: egocentric video + natural language question.
Feature extraction:
Visual features via EVA-CLIP
Audio features via Whisper
Text features via Flan-T5
Temporal alignment of modalities into a shared space.
Weak supervision using pseudo-labels derived from cosine similarity.
Gaussian Contrastive Grounding to model temporal relevance as a Gaussian distribution over time for locating relevant video segments.
Conclusion
The work presented a Temporal Question Answering frame-work for egocentric video understanding, with a detailed com-parative analysis between a Single-Pass Gaussian Contrastive Grounding (GCG) model and a Two-Pass Coarse-to-Fine (C2F) model. The study demonstrates that while the Single-Pass approach provides a computationally efficient baseline, it suffers from limitations in temporal precision, multimodal alignment, and contextual reasoning.
Through extensive experimental analysis, including quan-titative evaluation and qualitative comparisons, the Two-Pass C2F model consistently outperforms the Single-Pass model across multiple dimensions. The hierarchical grounding strat-egy enables more accurate localization of relevant temporal segments, effectively addressing the frame dilution problem in longer videos. Additionally, the integration of audio-visual features enhances the model’s ability to capture both visible and non-visible events, leading to improved contextual under-standing.
The qualitative results further validate these findings, show-ing that the Two-Pass model produces more precise and context-aware responses, particularly for action-based and reasoning-intensive queries. The inclusion of explicit tempo-ral grounding also improves interpretability, allowing users to trace answers back to specific video segments, thereby increasing transparency and reliability.
Despite these improvements, the Two-Pass model introduces additional computational overhead compared to the Single-Pass approach. This highlights a trade-off between efficiency and performance, suggesting that model selection should be guided by application requirements.
The proposed framework provides a robust foundation for real-world video question answering systems, particularly in applications such as assistive technologies, intelligent video analysis, and memory augmentation systems. Future work will focus on optimizing the model for real-time deployment, improving reasoning capabilities, and extending the framework to incorporate additional modalities for richer contextual un-derstanding.
References
[1] K. Grauman, A. Westbury, M. Chavis, and Others, “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[2] J. Xiao, X. Shang, A. Yao, and T.-S. Chua, “Next-qa: Explaining temporal actions via question answering,” arXiv preprint, 2021.
[3] J. Lei, T. L. Berg, and M. Bansal, “Mart: Memory-attended recurrent transformers for long-term video question answering,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
[4] A. Radford, J. W. Kim, C. Hallacy, and Others, “Clip: Connecting text and images,” OpenAI Technical Report, 2021.
[5] H. Touvron, L. Albert, T. Bresson, and Others, “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[6] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” in Advances in Neural Information Processing Systems (NeurIPS), 2023.
[7] H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Temporal sentence grounding in videos: A survey and future directions,” arXiv preprint, 2023.
[8] H. Wang, C. Lai, Y. Sun, and W. Ge, “Weakly supervised gaussian contrastive grounding with large multimodal models for video question answering,” in Proceedings of the ACM International Conference on Multimedia (ACM MM), Melbourne, Australia, 2024.
[9] Z. Yang, X. He, and J. Gao, “Stacked attention networks for visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[10] J. Xu, T. Liu, C. Ou, and Others, “Activitynet-qa: A dataset for under-standing complex web videos via question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[11] J. Gao, R. Ge, K. Chen, and R. Nevatia, “Motion-appearance co-memory networks for video question answering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2018.
[12] Z. Fan, B. Jiang, and D. Lin, “Hipnet: Hierarchical interactive memory network for video question answering,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020.
[13] X. Li, Y. Song, and S. Fan, “Beyond rnns: Positional self-attention for video question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2019.
[14] W.-Y. Jin, S. Lee, Y. Cho, and Others, “Video-llama: An instruction-following video large language model,” arXiv preprint, 2022.
[15] M. Peng, Y. Wu, A. Wang, and Others, “Progressive attention memory network for video question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[16] M. Tapaswi, Y. Zhu, and R. Stiefelhagen, “Movieqa: Understanding stories in movies through question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[17] S. Yu, J. Cho, P. Yadav, and M. Bansal, “Self-chained image-language model for video localization and question answering,” in Advances in Neural Information Processing Systems (NeurIPS), 2023.
[18] Y. Li, J. Xiao, C. Feng, X. Wang, and T.-S. Chua, “Discovering spatio-temporal rationales for video question answering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
[19] J. Lei, L. Yu, M. Bansal, and T. L. Berg, “Tvqa+: Spatio-temporal grounding for video question answering,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[20] Z. Shen, L. Li, Z. Lin, and Others, “Weakly supervised dense event captioning in videos,” in Advances in Neural Information Processing Systems (NeurIPS), 2017.
[21] P. Seo, A. Nagrani, A. Arnab, and Others, “Attentive moment retrieval in videos,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
[22] Y. Lei, H. Tan, and M. Bansal, “Symbolic replay for long video question answering,” in Proceedings of the European Conference on Computer Vision (ECCV), 2022.
[23] J.-B. Alayrac, J. Donahue, P. Liu, and Others, “Flamingo: A visual lan-guage model for few-shot learning,” in Advances in Neural Information Processing Systems (NeurIPS), 2023.
[24] Y. Fang, P. Sun, X. Chen, and Others, “Eva: Exploring the limits of masked visual representation learning at scale,” arXiv preprint, 2023.
[25] J. Xiao, A. Yao, Y. Li, and T.-S. Chua, “Can i trust your answer? visually grounded video question answering,” arXiv preprint arXiv:2309.01327, 2023.